[WIP] refactor of dataset builder and executor #537

cyruszhang · 2025-01-09T21:18:57Z

Key elements of this PR:

YAML显式定义dataset不同来源；local和remote分开定义
更加灵活开放的数据集参数化控制；根据不同来源，支持不同参数和相关验证；并留出口子支持更多追加/细节配置
解绑Executor的hardcode支持（目前RayExecutor只接受local json格式，并在代码层面hardcode绑定）；Executor/RayExecutor不绑定dataset输入格式，但是根据formatter/downloader对于executor类型的支持来判断是否可加载
提高Executor框架扩展性，以更方便支持Nemo、Dask、Spark等其他引擎
支持数据格式验证
额外的数据来源支持
a. 支持modelscope
b. 支持arxiv，下载、解压、引入
c. 支持wiki，下载、解压、引入
d. 支持commoncrawl，下载、解压、引入
兼容命令行目前的dataset_path格式
兼容数据混搭，data mixture
兼容empty_formatter/generated_dataset_config通路

design doc: https://aliyuque.antfin.com/yilei.z/cnk4dn/qomvqql62lyglrh2?singleDoc# 《Dataset/Loader/Executor的重构设计》

…ting

… space in file name

HYLcool · 2025-02-06T07:53:55Z

data_juicer/core/data/data_validator.py

+
+        # Validate conversation structure
+        for item in dataset:
+            turns = self._parse_turns(item['text'])


These classes are still in progress, right? Do they need to be updated or implemented later?

The dataset format of conversations can be referred here.

HYLcool · 2025-02-06T08:27:14Z

data_juicer/core/data/data_validator.py

+            MAX_SAMPLE_SIZE = 1000
+            if isinstance(dataset, NestedDataset):
+                sample_size = min(MAX_SAMPLE_SIZE, len(dataset))
+                sample = dataset.select(range(sample_size))


For hf dataset, we can use dataset.take(n) method to get the top-n samples for higher efficiency. Related doc

HYLcool · 2025-02-07T03:16:24Z

data_juicer/core/data/load_strategy.py

+    }
+
+    def load_data(self, **kwargs):
+        dataset = rd.read_json(self.ds_config['path'])


Use RayDataset.read_json() instead to support stream reading for json file. Ref:

data-juicer/data_juicer/core/ray_data.py

Lines 198 to 207 in 449cac1

@classmethod

def read_json(cls, paths: Union[str, List[str]]) -> RayDataset:

# Note: a temp solution for reading json stream

# TODO: replace with ray.data.read_json_stream once it is available

import pyarrow.json as js

try:

js.open_json

return read_json_stream(paths)

except AttributeError:

return rd.read_json(paths)

HYLcool · 2025-02-07T03:20:11Z

data_juicer/core/data/load_strategy.py

+
+    def load_data(self, **kwargs):
+        raise NotImplementedError(
+            'Huggingface data load strategy is not implemented')


'Huggingface data load strategy for Ray is not implemented'

HYLcool · 2025-02-07T03:27:46Z

data_juicer/core/data/ray_dataset.py

@@ -86,7 +36,8 @@ def __init__(self,
                 dataset: rd.Dataset,
                 dataset_path: str = None,
                 cfg=None) -> None:
-        self.data = preprocess_dataset(dataset, dataset_path, cfg)
+        self.data = dataset
+        # self.data = preprocess_dataset(dataset, dataset_path, cfg)


Is preprocess_dataset necessary? @pan-x-c

HYLcool · 2025-02-11T13:43:42Z

data_juicer/download/downloader.py

+import pandas as pd
+import regex as re
+import requests
+from bs4 import BeautifulSoup


Add bs4 in the minimal requirements

HYLcool · 2025-02-12T02:04:33Z

data_juicer/download/arxiv.py

+
+# The iterator and extractor code are in large part taken
+# from the Red-Pajama repo
+# https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv


Should be https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1/data_prep/arxiv

HYLcool · 2025-02-12T03:07:47Z

data_juicer/download/wikipedia.py

+# implementation of the Wikipedia dataset preparation:
+# https://github.com/huggingface/datasets/blob/7e30308f49f8c85dc7a2ab5aafbff04b5d2f38e2/datasets/wikipedia/wikipedia.py
+
+MEDIA_ALIASES = {


Why not import them from datasets?

HYLcool · 2025-02-12T03:34:49Z

tests/core/test_dataset_builder.py

+WORK_DIR = os.path.dirname(os.path.realpath(__file__))
+
+
+@SKIPPED_TESTS.register_module()


Add comment to describe the reason to skip this test.

HYLcool · 2025-02-12T03:38:28Z

tests/core/test_dataset_builder.py

+
+
+    def test_rewrite_cli_datapath_local_single_file(self):
+        dataset_path = "./data/sample.txt"


Setting the path with the current file path WORK_DIR is better for tracing for readers. Ref:

data-juicer/tests/ops/filter/test_audio_duration_filter.py

Lines 12 to 16 in b91683b

data_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), '..',

'data')

aud1_path = os.path.join(data_path, 'audio1.wav') # about 6s

aud2_path = os.path.join(data_path, 'audio2.wav') # about 14s

aud3_path = os.path.join(data_path, 'audio3.ogg') # about 1min59s

cyruszhang added 30 commits November 15, 2024 10:51

ignore __dj__produced_data__

d11f89c

add download framework; add wiki support

41dea26

refactor formatter; add dataset_builder

50f8d3d

merge with master

817caab

add config files and test entry

a089de4

initial dataset_builder

5a717d7

Merge branch 'main' into feat/cyruszhang/data-downloader

9c79844

add mixture dataset support; type/subtype

ffba7e7

RayExecutor with ExecutorBase

79ae980

get rid of subtype for local dataset; depending on ext for proper rou…

e6a6e71

…ting

use source instead of sub_type for remote dataset configs

eb300f0

arxiv downloader return Dataset instead of DJDataset

456eea1

rewrite CLI datapath with test cases

c25e40f

add executor and dataload strategy logic

75ffe3f

Merge branch 'main' into feat/cyruszhang/data-downloader

4ec1ef9

add layered load strategies

4fb6e17

Merge branch 'main' into feat/cyruszhang/data-downloader

84803cd

fix circular dependency; add dataset config test

cb5b80a

update dataset_path parsing in config

daf7a85

fix download test case; add wildcard matching for load strategy

7c48892

add test case for load strategy wild card matching

940b44d

add more test cases for datapath rewrite logic; fix rewrite to handle…

b80f991

… space in file name

materialize symlinks for duplicates

0d5d4ba

add load strategy validation framework

f3a4ec4

add DataValidator logic

70fffd2

data validator as separate pre-processing

bbc303d

update data validator logic and add/fix test cases

4b6065f

[nit] rename test

0b153ab

[nit] rename test again

171b361

add builder test cases; update ds config validation logic

6841d19

cyruszhang requested a deployment to Testing January 21, 2025 20:26 — with GitHub Actions Waiting

fix bugs; use str for executor_type

dd95df0

cyruszhang requested a deployment to Testing January 23, 2025 21:47 — with GitHub Actions Waiting

add add_same_content_to_new_column reference

530efa8

cyruszhang requested a deployment to Testing January 23, 2025 22:28 — with GitHub Actions Waiting

cyruszhang added 3 commits January 23, 2025 21:51

ray data defaults to json

3b726bd

fix dataset_path bug; add ray config test

cac8e5e

tests video on ray config

a99c9b5

cyruszhang requested a deployment to Testing January 24, 2025 19:35 — with GitHub Actions Waiting

add default cfg logic; fix data_mixture demo

3c9caf5

cyruszhang requested a deployment to Testing January 24, 2025 20:31 — with GitHub Actions Waiting

cyruszhang added 3 commits January 27, 2025 12:26

default executor + local data; fix analyzer bug

b9f6a99

Merge branch 'main' into feat/cyruszhang/data-downloader

e05f146

pass through num_proc param for ray executor when loading dataset

acccc01

cyruszhang requested a deployment to Testing January 27, 2025 20:32 — with GitHub Actions Waiting

fix bugs for huggingface dataset loading; add sample config

1823cd6

cyruszhang requested a deployment to Testing January 27, 2025 22:52 — with GitHub Actions Waiting

fix typo in configs

2963118

cyruszhang had a problem deploying to Testing January 29, 2025 18:20 — with GitHub Actions Failure

remove absolute path logic; remove dup test files

4472aef

cyruszhang requested a deployment to Testing February 7, 2025 02:05 — with GitHub Actions Waiting

update .gitignore for dup files in tests

7964867

cyruszhang had a problem deploying to Testing February 7, 2025 02:20 — with GitHub Actions Failure

cyruszhang added 3 commits February 7, 2025 10:57

fix RayDataset schema validation issue

96207ba

fix wiki downloader tests

9b1d738

remove mixture formatter; logic captured in dataloader

828e7ba

cyruszhang had a problem deploying to Testing February 7, 2025 20:23 — with GitHub Actions Failure

cyruszhang removed the request for review from drcege February 7, 2025 20:39

HYLcool reviewed Feb 12, 2025

View reviewed changes

HYLcool assigned cyruszhang Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] refactor of dataset builder and executor #537

[WIP] refactor of dataset builder and executor #537

cyruszhang commented Jan 9, 2025

HYLcool Feb 6, 2025

HYLcool Feb 6, 2025

HYLcool Feb 6, 2025

HYLcool Feb 7, 2025

HYLcool Feb 7, 2025

HYLcool Feb 7, 2025

HYLcool Feb 11, 2025

HYLcool Feb 12, 2025

HYLcool Feb 12, 2025

HYLcool Feb 12, 2025

HYLcool Feb 12, 2025

	@classmethod
	def read_json(cls, paths: Union[str, List[str]]) -> RayDataset:
	# Note: a temp solution for reading json stream
	# TODO: replace with ray.data.read_json_stream once it is available
	import pyarrow.json as js
	try:
	js.open_json
	return read_json_stream(paths)
	except AttributeError:
	return rd.read_json(paths)

		WORK_DIR = os.path.dirname(os.path.realpath(__file__))


		@SKIPPED_TESTS.register_module()



		def test_rewrite_cli_datapath_local_single_file(self):
		dataset_path = "./data/sample.txt"

	data_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), '..',
	'data')
	aud1_path = os.path.join(data_path, 'audio1.wav') # about 6s
	aud2_path = os.path.join(data_path, 'audio2.wav') # about 14s
	aud3_path = os.path.join(data_path, 'audio3.ogg') # about 1min59s

[WIP] refactor of dataset builder and executor #537

Are you sure you want to change the base?

[WIP] refactor of dataset builder and executor #537

Conversation

cyruszhang commented Jan 9, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment